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We wish to congratulate the authors for their innovative contribution, 
which is bound to inspire much further research. We find latent variable 
model selection to be a fantastic application of matrix decomposition meth- 
ods, namely, the superposition of low-rank and sparse elements. Clearly, the 
methodology introduced in this paper is of potential interest across many 
disciplines. In the following, we will first discuss this paper in more detail 
and then reflect on the versatility of the low-rank + sparse decomposition. 

Latent variable model selection. The proposed scheme is an extension of 
the graphical lasso of Yuan and Lin [15] (see also [1, 6]), which is a popular 
approach for learning the structure in an undirected Gaussian graphical 
model. In this setup, we assume we have independent samples X~AA(0, E) 
with a covariance matrix S exhibiting a sparse dependence structure but 
otherwise unknown; that is to say, most pairs of variables are conditionally 
independent given all the others. Formally, the concentration matrix X -1 is 
assumed to be sparse. A natural fitting procedure is then to regularize the 
likelihood by adding a term proportional to the l\ norm of the estimated 
inverse covariance matrix S: 

(1) minimize -^(S.Ejf) + A||S||i 

under the constraint S >z 0, where £q is the empirical covariance matrix and 
|| SHi = Y^ij (Variants are possible depending upon whether or not one 
would want to penalize the diagonal elements.) This problem is convex. 

When some variables are unobserved — the observed and hidden variables 
are still jointly Gaussian — the model above may not be appropriate because 
the hidden variables can have a confounding effect. An example is this: 
we observe stock prices of companies and would like to infer conditional 
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(in)dependence. Suppose, however, that all these companies rely on a com- 
modity, a source of energy, for instance, which is not observed. Then the 
stock prices might appear dependent even though they may not be once we 
condition on the price of this commodity. In fact, the marginal inverse co- 
variance of the observed variables decomposes into two terms. The first is the 
concentration matrix of the observed variables in the full model conditioned 
on the latent variables. The second term is the effect of marginalization over 
the hidden variables. Assuming a sparse graphical model, the first term is 
sparse, whereas the second term may have low rank; in particular, the rank 
is at most the number of hidden variables. The authors then penalize the 
negative log-likelihood with a term proportional to 

(2) 7||5||i + trace(L) 

since the trace functional is the usual convex surrogate for the rank over the 
cone of positive semidefinite matrices. The constraints are S y L y 0. 

Adaptivity. The penalty (2) is simple and flexible since it does not really 
make special parametric assumptions. To be truly appealing, it would also 
need to be adaptive in the following sense: suppose there is no hidden vari- 
able, then does the low-rank + sparse model (L + S) behave as well or nearly 
as well as the graphical lasso? When there are few hidden variables, does it 
behave nearly as well? Are there such theoretical guarantees? If this is the 
case, it would say that using the L + S model would protect against the dan- 
ger of not having accounted for all possible covariates. At the same time, if 
there were no hidden variable, one would not suffer any loss of performance. 
Thus, we would get the best of both worlds. 

At first sight, the analysis presented in this paper does not allow us to 
reach this conclusion. If X is p-dimensional, the number of samples needed to 
show that one can obtain accurate estimates scales like ^(j>/£ 4 ), where £ is a 
modulus of continuity introduced in the paper that is typically much smaller 
than 1. We can think of l/£ as being related to the maximum degree d of 
the graph so that the condition may be interpreted as having a number of 
observations very roughly scaling like d 4 p. In addition, accurate estimation 
holds with the proviso that the signal is strong enough; here, both the min- 
imum nonzero singular value of the low-rank component and the minimum 
nonzero entry of the sparse component scale like Q.(yjp/n). On the other 
hand, when there are no hidden variables, a line of work [11, 13, 14] has es- 
tablished that we could estimate the concentration matrix with essentially 
the same accuracy if n = 0(d 2 logp) and the magnitude of the minimum 
nonvanishing value of the concentration matrix scales like fl^y/n -1 logp). 
As before, d is the maximum degree of the graphical model. In the high- 
dimensional regime, the results offered by this literature seem considerably 
better. It would be interesting to know whether this could be bridged, and 
if so, under what types of conditions — if any. 
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Interestingly, such adaptivity properties have been established for re- 
lated problems. For instance, the L + S model has been used to suggest 
the possibility of a principled approach to robust principal component anal- 
ysis [2]. Suppose we have incomplete and corrupted information about an 
n\ x ni low-rank matrix L°. More precisely, we observe = L^- + Sfj, 
where (i, j) € £l Q hs C {1, . . . , ni} x {1, . . . , 77-2}. We think of S° as a corrup- 
tion pattern so that some entries are totally unreliable but we do not know 
which ones. Then [2] shows that under rather broad conditions, the solution 
to 

minimize + 

(3) 

subject to Mij = Lij + Sij,(i,j) e Q ohs , 

where \\L\\* is the nuclear norm, recovers L° exactly. Now suppose there 
are no corruptions. Then we are facing a matrix completion problem and, 
instead, one would want to minimize the nuclear norm of L under data 
constraints. In other words, there is no need for S in (3). The point is that 
there is a fairly precise understanding of the minimal number of samples 
needed for this strategy to work; for incoherent matrices [3], |f2 bs| must 
scale like (n\ Vn2)rlog 2 n, where r is the rank of L°. Now some recent 
work [10] establishes the adaptivity in question. In details, (3) recovers L° 
from a minimal number of samples, in the sense defined above, even though 
a positive fraction may be corrupted. That is, the number of reliable samples 
one needs, regardless of whether corruption occurs, is essentially the same. 
Results of this kind extend to other settings as well. For instance, in sparse 
regression or compressive sensing we seek a sparse solution to y = Xb by 
minimizing the l\ norm of b. Again, we may be worried that some equations 
are unreliable because of gross errors and would solve, instead, 

minimize ||6||i + A||e||i 

(4) 

subject to y = Xb + e 

to achieve robustness. Here, [10] shows that the minimal number of reli- 
able samples/equations required, regardless of whether the data is clean or 
corrupted, is essentially the same. 

The versatility of the L + S model. We now move to discuss the L + S 
model more generally and survey a set of circumstances where it has proven 
useful and powerful. To begin with, methods which simply minimize an l\ 
norm, or a nuclear norm, or a combination thereof are seductive because they 
are flexible and apply to a rich class of problems. The L + S model is non- 
parametric and does not make many assumptions. As a result, it is widely 
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applicable to problems ranging from latent variable model selection [4] (ar- 
guably one of the most subtle and beautiful applications of this method) 
to video surveillance in computer vision and document classification in ma- 
chine learning [2]. In any given application, when much is known about the 
problem, it may not return the best possible answer, but our experience 
is that it is always fairly competitive. That is, the little performance loss 
we might encounter is more than accounted for by the robustness we gain 
vis a vis various modeling assumptions, which may or may not hold in real 
applications. A few recent applications of the L + S model demonstrate its 
flexibility and robustness. 

Applications in computer vision. The L + S model has been applied to 
address several problems in computer vision, most notably by the group of 
Yi Ma and colleagues. Although the low-rank + sparse model may not hold 
precisely, the nuclear + l\ relaxation appears practically robust. This may 
be in contrast with algorithms which use detailed modeling assumptions and 
may not perform well under slight model mismatch or variation. 

Video surveillance. An important task in computer vision is to separate 
background from foreground. Suppose we stack a sequence of video frames 
as columns of a matrix (rows are pixels and columns time points), then it is 
not hard to imagine that the background will have low-rank since it is not 
changing very much over time, while the foreground objects, such 
pedestrians and so on, can be seen as a sparse disturbance. Hence, finding 
an L + S decomposition offers a new way of modeling the background (and 
foreground). This method has been applied with some success [2]; see also 
the online videos Video 1 and Video 2. 

From textures to 3D. One of the most fundamental steps in computer 
vision consists of extracting relevant features that are subsequently used for 
high-level vision applications such as 3D reconstruction, object recognition 
and scene understanding. There has been limited success in extracting stable 
features across variations in lightening, rotations and viewpoints. Partial 
occlusions further complicate matters. For certain classes of 3D objects such 
as images with regular symmetric patterns/textures, one can bypass the 
extraction of local features to recover 3D structure from 2D views. To fix 
ideas, a vertical or horizontal strip can be regarded as a rank-1 texture and 
a corner as a rank-2 texture. Generally speaking, surfaces may exhibit a low- 
rank texture when seen from a suitable viewpoint; see Figure 1. However, 
their 2D projections as captured by a camera will typically not be low rank. 
To see why, imagine there is a low-rank texture L°(x, y) on a planar surface. 
The image we observe is a transformed version of this texture, namely, L° o 
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T~ 1 (x,y). A technique named TILT [16] recovers r simply by seeking a low- 
rank and sparse superposition. In spite of idealized assumptions, Figures 1 
and 2 show that the L + S model works well in practice. 

Compressive acquisition. In the spirit of compressive sensing, the L + S 
model can also be used to speed up the acquisition of large data sets or lower 
the sampling rate. At the moment, the theory of compressive sensing relies 
on the sparsity of the object we wish to acquire, however, in some setups 
the L + S model may be more appropriate. To explain our ideas, it might 
be best to start with two concrete examples. Suppose we are interested in 
the efficient acquisition of either (1) a hyper-spectral image or (2) a video 
sequence. In both cases, the object of interest is a data matrix M which is 
N x d, where each column is an iV-pixel image and each of the d columns 
corresponds to a specific wavelength (as in the hyper-spectral example) or 
frame (or time point as in the video example). In the first case, the data 
matrix may be thought of as M(x, A), where x indexes position and A wave- 
length, whereas in the second example, we have M(x,t) where t is a time 
index. We would like to obtain a sequence of highly resolved images from 
just a few measurements; an important application concerns dynamic mag- 
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Fig. 2. We are given the 16 images on the right. The task is to remove the clutter 
and align the images. Stacking each image as a column of a matrix, we look for planar 
homeographies that reveal a low-rank plus sparse structure [12]. From left to right: original 
data set, aligned images, low-rank component ( columns of L ), sparse component ( columns 
ofS). 
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netic resonance imaging where it is only possible to acquire a few samples 
in fc-space per time interval. 

Clearly, frames in a video sequence are highly correlated in time. And 
in just the same way, two images of the same scene at nearby wavelengths 
are also highly correlated. Obviously, images are correlated in space as well. 
Suppose that W (8> F is a tensor basis, where W sparsifies images and F time 
traces (W might be a wavelet transform and F a Fourier transform). Then 
we would expect WMF to be a nearly sparse matrix. With undersampled 
data of the form y = A(M) + z, where A is the operator supplying infor- 
mation about M and z is a noise term, this leads to the low-rank + sparse 
decomposition problem 

minimize + A||WXF||i 

(5) 

subject to ||^4(X) — y\\2 < e, 

where e 2 is the noise power. A variation, which is more in line with the 
discussion paper is a model in which L is a low-rank matrix modeling the 
static background, and S is a sparse matrix roughly modeling the innovation 
from one frame to the next; for instance, S might encode the moving objects 
in the foreground. This would give 

minimize A||L||* + ||W5F||i 

(6) 

subject to \\A(L + S)- y\\ 2 < s. 

One could imagine that these models might be useful in alleviating the 
tremendous burden on system resources in the acquisition of ever larger 3D, 
4D and 5D data sets. 

We note that proposals of this kind have begun to emerge. As we were 
preparing this commentary, we became aware of [8], which suggests a model 
similar to (5) for hyperspectral imaging. The difference is that the second 
term in (5) is of the form Y2i II^IItv hi which Xi is the ith column of X, 
the image at wavelength Aj; that is, we minimize the total variation of each 
image, instead of looking for sparsity simultaneously in space and wave- 
length /frequency. The results in [8] show that dramatic undersampling ratios 
are possible. In medical imaging, movement due to respiration can degrade 
the image quality of Computed Tomography (CT), which can lead to incor- 
rect dosage in radiation therapy. Using time-stamped data, 4D CT has more 
potential for precise imaging. Here, one can think of the object as a matrix 
with rows labeling spatial variables and columns time. In this context, we 
have a low-rank (static) background and a sparse disturbance corresponding 
to the dynamics, for example, of the heart in cardiac imaging. The recent 
work [7] shows how one can use the L + S model in a fashion similar to (6). 
This has interesting potential for dose reduction since the approach also 
supports substantial undersampling. 
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Connections with theoretical computer science and future directions. 

A class of problems where further study is required concerns situations in 
which the low-rank and sparse components have a particular structure. One 
such problem is the planted clique problem. It is well known that finding the 
largest clique in a graph is NP hard; in fact, it is even NP-hard to approx- 
imate the size of the largest clique in an n vertex graph to within a factor 
n l ~ £ . Therefore, much research has focused on an "easier" problem. Con- 
sider a random graph G(n, 1/2) on n vertices where each edge is selected 
independently with probability 1/2. The expected size of its largest clique is 
known to be (2 — o(l)) log n. The planted clique problem adds a clique of size 
k to G. One hopes that it is possible to find the planted clique in polynomial 
time whenever k 3> logn. At this time, this is only known to be possible if 
k is on the order of y/n or larger. In spite of its seemingly simple formu- 
lation, this problem has eluded theoretical computer scientists since 1998, 
and is regarded as a notoriously difficult problem in modern combinatorics. 
It is also fundamental to many areas in machine learning and pattern recog- 
nition. To emphasize its wide applicability, we mention a new connection 
with game theory. Roughly speaking, the recent work [9] shows that finding 
a near-optimal Nash equilibrium in two-player games is as hard as finding 
hidden cliques of size k = Co logn, where Co is some universal constant. 

One can think about the planted clique as a low rank + sparse decom- 
position problem. To be sure, the adjacency matrix of the graph can be 
written as the sum of two matrices: the low-rank component is of rank 1 
and represents the clique of size k (a submatrix with all entries equal to 1); 
the sparse component stands for the random edges (and with —1 on the di- 
agonal if and only if that vertex belongs to the hidden clique). Interestingly, 
low-rank + sparse regularization based on nuclear and i\ norms have been 
applied to this problem [5]. (Here the clique is both low-rank and sparse 
and is the object of interest so that we minimize ||A"||* + A||A||i subject to 
data constraints.) These proofs show that these methods find cliques of size 
Q,(y/n), thus recovering the best known results, but they may not be able 
to break this barrier. It is interesting to investigate whether tighter relax- 
ations, taking into account the specific structure of the low-rank and sparse 
components, can do better. 
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